我们表明,在将直接转换应用到数据集之后,自回归语言模型可以学会填充文本,这简单地将文本的跨度从文档的中间移动到了其末尾。虽然近年来这种数据增强引起了人们的极大兴趣,但我们提供了广泛的证据,表明以这种方式转换的数据很大一部分并不会损害原始的左右生成能力,这是通过困惑和抽样评估来衡量的广泛的尺度。鉴于培训模型对中间的有用性,简单性和效率(FIM),我们建议默认情况下使用FIM培训未来的自回归语言模型。为此,我们在关键的超参数上运行一系列消融,例如数据转换频率,转换的结构以及选择填充跨度的方法。我们使用这些消融来规定强大的默认设置和最佳实践来训练FIM模型。我们发布了最佳的填充模型,该模型在API中培训了最佳实践,并发布了我们的填充基准,以帮助未来的研究。
translated by 谷歌翻译
我们微调GPT-3使用基于文本的Web浏览环境来回答长形问题,允许模型搜索和导航Web。通过建立任务,以便通过人类执行,我们能够使用模仿学习培训在任务上的模型,然后通过人体反馈优化答案质量。为了使人为评估事实精度更容易,模型必须在浏览支持答案时收集引用。我们在ELI5上培训并评估我们的模型,Reddit用户提出的问题数据集。我们的最佳模型是通过使用行为克隆进行微调GPT-3获得的,然后对训练训练的奖励模型进行拒绝采样来获得以预测人类偏好。这种模式的答案是人类56%的答案,我们的人类示威者的时间和69%的时间到Reddit的最高投票答复。
translated by 谷歌翻译
最先进的语言模型可以在许多任务中匹配人类性能,但它们仍然努力努力执行多步数学推理。要诊断当前模型和支持研究的故障,我们介绍了GSM8K,是8.5k高质量的语言学级别学校数学词问题的数据集。我们发现即使是最大的变压器模型也无法实现高测试性能,尽管该问题分布的概念简单性。为了提高性能,我们提出培训验证者来判断模型完成的正确性。在测试时间,我们生成许多候选解决方案,并选择验证者排名最高的解决方案。我们证明,验证显着提高了GSM8K的性能,我们提供了强大的经验证据,即验证尺度更有效地具有比FineTuning基线的数据增加。
translated by 谷歌翻译
我们说,如果对其他超参数的更改可以在很大程度上补偿批处理大小的更改,则算法是批量尺寸不变的。随机梯度下降众所周知,该特性通过学习率将其具有小批量的大小。但是,由于它们如何控制策略更新的大小,因此某些策略优化算法(例如PPO)没有此属性。在这项工作中,我们展示了如何使这些算法批处理大小不变。我们的关键见解是将近端策略(用于控制策略更新)的近端策略(用于校正更正)。我们的实验有助于解释为什么这些算法起作用,并显示它们如何更有效地利用陈旧数据。
translated by 谷歌翻译
机器学习(ML)系统的大小迅速增加,正在获取新功能,并且越来越多地部署在高赌注设置中。与其他强大的技术一样,ML的安全应成为主要的研究优先权。为了应对ML的新兴安全挑战,例如由最近的大型模型引入的策略,我们为ML安全提供了新的路线图,并完善了现场需要解决的技术问题。我们为研究提供了四项问题,即危害危险(“鲁棒性”),识别危险(“监测”),转向ML系统(“对齐”),减少部署危险(“外部安全性”)。在整个过程中,我们澄清了每个问题的动机并提供了具体的研究方向。
translated by 谷歌翻译
We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We believe that the community will benefit from increased access to high quality training environments, and we provide detailed experimental protocols for using this benchmark. We empirically demonstrate that diverse environment distributions are essential to adequately train and evaluate RL agents, thereby motivating the extensive use of procedural content generation. We then use this benchmark to investigate the effects of scaling model size, finding that larger models significantly improve both sample efficiency and generalization.
translated by 谷歌翻译
This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound of the mutual information objective that can be optimized efficiently. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing supervised methods.
translated by 谷歌翻译
Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning. Some notable examples include training agents to play Atari games based on raw pixel data and to acquire advanced manipulation skills using raw sensory inputs. However, it has been difficult to quantify progress in the domain of continuous control due to the lack of a commonly adopted benchmark. In this work, we present a benchmark suite of continuous control tasks, including classic tasks like cart-pole swing-up, tasks with very high state and action dimensionality such as 3D humanoid locomotion, tasks with partial observations, and tasks with hierarchical structure. We report novel findings based on the systematic evaluation of a range of implemented reinforcement learning algorithms. Both the benchmark and reference implementations are released at https://github.com/ rllab/rllab in order to facilitate experimental reproducibility and to encourage adoption by other researchers.
translated by 谷歌翻译
Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks. The two main challenges are the large number of samples typically required, and the difficulty of obtaining stable and steady improvement despite the nonstationarity of the incoming data. We address the first challenge by using value functions to substantially reduce the variance of policy gradient estimates at the cost of some bias, with an exponentially-weighted estimator of the advantage function that is analogous to TD(λ). We address the second challenge by using trust region optimization procedure for both the policy and the value function, which are represented by neural networks. Our approach yields strong empirical results on highly challenging 3D locomotion tasks, learning running gaits for bipedal and quadrupedal simulated robots, and learning a policy for getting the biped to stand up from starting out lying on the ground. In contrast to a body of prior work that uses hand-crafted policy representations, our neural network policies map directly from raw kinematics to joint torques. Our algorithm is fully model-free, and the amount of simulated experience required for the learning tasks on 3D bipeds corresponds to 1-2 weeks of real time.
translated by 谷歌翻译
We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO). This algorithm is similar to natural policy gradient methods and is effective for optimizing large nonlinear policies such as neural networks. Our experiments demonstrate its robust performance on a wide variety of tasks: learning simulated robotic swimming, hopping, and walking gaits; and playing Atari games using images of the screen as input. Despite its approximations that deviate from the theory, TRPO tends to give monotonic improvement, with little tuning of hyperparameters.
translated by 谷歌翻译